Simple Linear Regression

Author
Affiliation

Daniela Palleschi

Humboldt-Universität zu Berlin

Published

April 13, 2023

knitr::opts_chunk$set(eval = T, # change this to 'eval = T' to reproduce the analyses; make sure to comment out
                      echo = T, # 'print code chunk?'
                      message = F, # 'print messages (e.g., warnings)?'
                      error = F,
                      warning = F)

1 Set-up environment

1.1 Load packages

# suppress scientific notation
options(scipen=999)

# load libraries
pacman::p_load(tidyverse,
               broom,
               patchwork,
               knitr,
               kableExtra)

1.2 Set seed

  • the set.seed() function is useful when generating random numbers in R
  • it allws us to set a seed and generator for random number generation
  • otherwise, every time I ran this code (and every time you ran code), I would get different random numbers
    • it therefore allows us to ensure full reproducibility
set.seed(1954)

1.3 Load in data

  • force character variables to factors
  • filter for the verb region from critical items only, remove participant 3, and remove values of first-fixtation that are 0
# load in dataset
df_crit_verb <-
  readr::read_csv(
    here::here("data/tidy_data_lifetime_pilot.csv"),
    # for special characters
    locale = readr::locale(encoding = "latin1")
  ) |>
  mutate_if(is.character, as.factor) |> # all character variables as factor
  filter(type == "critical", # only critical trials
         px != "px3", # px3 had a lot of missing values
         ff > 0, # only values of ff above 0
         region == "verb") %>% # critical region only
  droplevels() # remove any factor levels with no observations

2 Resources

  • these slides are based on a mix of the following resources

DeBruine & Barr (2021); Winter (2013); Winter (2014); Winter (2019)

  • and on slides that were originally based on Field et al. (2013), but if you’re looking for a textbook I’d recommend Winter (2019)

3 (Linear) Regression

  • our data exploration has given us an idea about what our data look like
  • but now we want to be able to make predictions about hypothetical observations, i.e., to predict values of our DV based on one (or more) IV(s)
    • so we fit a model to our data, and use it to predict values of our DV based on one (or more) IV(s)
    • i.e., predicting an outcome variable (DIV) from one or more predictors (IVs)
  • because we’re making predictions, we need to take into account the variability (i.e., error) in our data

3.1 Types of regression

tribble(
  ~"regression type", ~"predictor", ~"outcome",
  "simple regression", "Single predictor", "continuous (numerical)",
  "multiple regression", "multiple predictor", "continuous (numerical)",
  "hierarchical/linear mixed models/linear mixed effect models", "include random effect", "continuous (numerical)",
  "generalised linear (mixed) models/logistic regression", "as above","binary/binomial data or count data"
) %>% 
  kable() %>% 
  kable_styling()
regression type predictor outcome
simple regression Single predictor continuous (numerical)
multiple regression multiple predictor continuous (numerical)
hierarchical/linear mixed models/linear mixed effect models include random effect continuous (numerical)
generalised linear (mixed) models/logistic regression as above binary/binomial data or count data

3.2 Straight lines

  • regression summarises the data with a straight line
  • straight lines can be defined by
    • Intercept (\(b_0\))
      • value of \(Y\) when \(X = 0\)
    • Slope (\(b_1\))
      • regression coefficient for the predictor
      • gradient (slope) f the regression line
      • direction/strenth of relationship
  • so we need to define an intercept and a slope

3.2.1 Intercept (\(b_0\))

  • the value of \(y\) when \(x = 0\)

3.2.2 Slopes (\(b_1\))

  • slopes describe a change in \(x\) (\(\Delta x\)) over a change in \(y\) (\(\Delta y\))
    • positive slope: as \(x\) increases, \(y\) increases
    • negative slope: as \(x\) increases, \(y\) decreases
    • if the slope is 0, there is no change in \(y\) as a function of \(x\)
  • or: the change in \(y\) when \(x\) increase by 1 unit
    • sometimes referred to as “rise over run”: how do you ‘rise’ in \(y\) for a given ‘run’ in \(x\)?

\[ slope = \frac{\Delta x}{\Delta y} \]


  • what is the intercept of this line?
  • what is the slope of this line?

3.2.3 A line = intercept and slope

  • a line is defined by its intercept and slope
    • in a regression model, these two are called coefficients

Image source: Winter (2019) (all rights reserved)

3.2.4 Error and residuals

  • fixed effects (IV/predictors): things we can understand/measure
  • error (random effects): things we cannot understand/measure
    • in biology, social sciences (and linguistic research), there will always sources of random error that we cannot account for
    • random error is less an issue in e.g., physics (e.g., measuring gravitational pull)
  • residuals: the difference (vertical difference) between observed data and the fitted values (predicted values)

Equation of a line

\[ \begin{align} y & = mx + c\\ Y_i &= (b_0 + b_1X_i) + \epsilon_i\\ outcome_i & = (model) + error_i\\ y_i & = (intercept + slope*x_i) + error_i \end{align} \]

4 Linear model: first-pass reading time

pj <- position_jitter(0.2)
pd <- position_dodge(0.2)
pjd <- position_jitterdodge(0.2)
df_crit_verb %>% 
  filter(region=="verb") %>% 
  ggplot(aes(x = lifetime, y = fp, colour = lifetime)) +
  geom_point(position = pj) +
  theme(legend.position = "none") +
  theme_bw()

4.1 Plotting the data points

  • we want to plot first-fixation times at the verb region
  • we will start with one predictor: lifetime (dead, living)
    • but we need it to be numerical
  • let’s create a new variable lifetime_n here:
    • living = -0.5
    • dead = +0.5
df_crit_verb <-
  df_crit_verb |>
  mutate(
    lifetime_n = ifelse(lifetime == "living", -0.5, +0.5),
    tense_n = ifelse(lifetime == "PP", -0.5, +0.5)
  ) 
fig_fp_scat <- 
  df_crit_verb %>% 
  filter(region=="verb") %>% 
  ggplot(aes(x = lifetime, y = fp, colour = lifetime)) +
  geom_point(position = pj) +
  theme(legend.position = "none") +
  theme_bw()
fig_fp_line <- 
  df_crit_verb %>% 
  filter(region=="verb") %>% 
  ggplot(aes(x = lifetime_n, y = fp, colour = lifetime_n)) +
  # geom_point(aes(x=lifetime), position = pj) +
  geom_smooth(aes(lifetime_n), method = "lm", se= F) +
  scale_x_continuous(breaks = seq(-0.5,+0.5,by=.5)) +
  theme(legend.position = "none") +
  theme_bw()
fig_fp_line1 <- 
  df_crit_verb %>% 
  filter(region=="verb") %>% 
  mutate(lifetime_n = ifelse(lifetime_n==-0.5,-1,1)) %>% 
  ggplot(aes(x = lifetime_n, y = fp, colour = lifetime_n)) +
  # geom_point(aes(x=lifetime), position = pj) +
  geom_smooth(aes(lifetime_n), method = "lm", se= F) +
  scale_x_continuous(breaks = seq(-1,+1,by=.5)) +
  theme(legend.position = "none") +
  theme_bw()
fig_fp_scat_line <- 
  df_crit_verb %>% 
  filter(region=="verb") %>% 
  ggplot(aes(x = lifetime_n, y = fp)) +
  geom_point(position = pj, aes(colour = as_factor(lifetime_n))) +
  geom_smooth(aes(as.numeric(lifetime_n)), method = "lm", se= F) +
  scale_x_continuous(breaks = seq(-0.5,+0.5,by=.5)) +
  theme(legend.position = "none") +
  theme_bw()
(fig_fp_line + fig_fp_line1) /
  (fig_fp_scat + fig_fp_scat_line)

4.2 lm()

  • the lm() function fits simple linear models
    • as arguments it takes a formula and a dataset, at minimum

\[ outcome \sim 1 + predictor \]

  • lm() function formula syntax can be read as: ff predicted by the intercept (1 is a placeholder for the intercept)
    • the intercept is included by default
    • if you omit the 1, the intercept is still included in the formula
    • if you wanted to remove the intercept (which you often don’t), you could replace 1 with 0

4.2.1 Running a model

fit_fp_1 <- lm(ff ~ 1, data = df_crit_verb) 

4.2.2 Model ouput

  • printing just the model gives us the formula and the coefficients
fit_fp_1

Call:
lm(formula = ff ~ 1, data = df_crit_verb)

Coefficients:
(Intercept)  
      199.1  

  • we typically use the summary() function to print full model outputs
summary(fit_fp_1)

Call:
lm(formula = ff ~ 1, data = df_crit_verb)

Residuals:
    Min      1Q  Median      3Q     Max 
-117.09  -37.09  -12.09   26.41  261.91 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  199.094      2.466   80.73 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.47 on 542 degrees of freedom

tidy(fit_fp_1)
# A tibble: 1 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     199.      2.47      80.7 2.82e-304

4.2.3 Summary

summary(fit_fp_1)
Call:
1lm(formula = ff ~ 1, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
2-117.09  -37.09  -12.09   26.41  261.91

Coefficients:
3            Estimate Std. Error t value            Pr(>|t|)
4(Intercept)  199.094      2.466   80.73 <0.0000000000000002 ***
---
5Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

6Residual standard error: 57.47 on 542 degrees of freedom
1
formula repetition
2
residuals: differences between observed values and those predicted by the model
3
names for columns Estimates, standard error, t-value, p-value (Pr(>|t|))
4
Intercept (\(b_0\)), i.e., value of \(y\) (first fix.) with a move of one unit of \(x\) (lifetime)
5
Significance codes
6
R\(^2\), a measure of model fit (squared residuals); percentage of variance in the data shared with the predictor (higher numbers are better…this is pretty low)

4.2.4 Intercept

  • the intercept is essentially the mean
# print model intercept?
coef(fit_fp_1)['(Intercept)']
(Intercept) 
   199.0939 
# print data mean
mean(df_crit_verb$ff)
[1] 199.0939

  • in the output, the intercept seems to be significant (indicated with a low p-value, and ***)
    • what does this mean?

Significance

4.3 Adding a fixed effect (slope)

  • now let’s include a slope
  • the slope represents the change in \(y\) (DV: ff) when we move 1-unit along \(y\) (IV: lifetime)
  • in other words, it tells us the effect our IV has on the DV
    • what is the change in first-fixation times when we move from dead to living referents?
  • lifetime is categorical, how can we move 1 unit?
    • a linear model requires \(x\) and \(y\) to be numerical, so it simply codes factor levels as 0 and 1 by default (in alphabetical order)
    • so the slope represents the difference between categorical conditions

4.3.1 Factors as numbers

  • when we code categorical data as numerical values, this is called contrast coding
# check contrasts
contrasts(df_crit_verb$lifetime)
       living
dead        0
living      1

4.4 Fit model (treatment contrasts)

# fit simple linear model
fit_ff_treat <- df_crit_verb %>%
  filter(ff > 0) %>%
  lm(ff ~ lifetime, data = .)
# alternatively
fit_ff_treat <- lm(ff ~ lifetime, 
            data = df_crit_verb, subset = ff > 0)

4.4.1 Model summary

summary(fit_ff_treat)

Call:
lm(formula = ff ~ lifetime, data = df_crit_verb, subset = ff > 
    0)

Residuals:
    Min      1Q  Median      3Q     Max 
-118.78  -37.78  -11.39   25.22  264.61 

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     201.783      3.484  57.920 <0.0000000000000002 ***
lifetimeliving   -5.388      4.931  -1.093               0.275    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.46 on 541 degrees of freedom
Multiple R-squared:  0.002202,  Adjusted R-squared:  0.0003575 
F-statistic: 1.194 on 1 and 541 DF,  p-value: 0.275

4.4.2 Intercept

# print model intercept?
coef(fit_ff_treat)['(Intercept)']
(Intercept) 
   201.7831 
# print data mean
mean(df_crit_verb$ff)
[1] 199.0939
  • our intercept is no longer the grand mean of first-fixation times…what is it?

4.4.3 Intercept for treatment contrasts

  • what are the means of our two factor levels?
# print model intercept?
coef(fit_ff_treat)['(Intercept)']
(Intercept) 
   201.7831 
Code
# compute summary 
summary_ff_life <- df_crit_verb |> 
  filter(region=="verb",
         ff > 0) |> 
  group_by(lifetime) %>%
  summarise(N = n(),
            mean = mean(ff, na.rm = T),
            sd = sd(ff, na.rm = T))

knitr::kable(summary_ff_life, digits=3,
             caption = "Summmary statistics for first-fixation duration at the verb region") %>% 
  kableExtra::kable_styling(font_size = 30,
                            position = "left")
Summmary statistics for first-fixation duration at the verb region
lifetime N mean sd
dead 272 201.783 55.197
living 271 196.395 59.639

4.4.4 Slope

  • what was our slope?
  • what does this correspond to?
# print model intercept?
coef(fit_ff_treat)['lifetimeliving']
lifetimeliving 
     -5.388254 
Code
# compute summary 
summary_ff_life <- df_crit_verb |> 
  filter(region=="verb",
         ff > 0) |> 
  group_by(lifetime) %>%
  summarise(N = n(),
            mean = mean(ff, na.rm = T),
            sd = sd(ff, na.rm = T))

knitr::kable(summary_ff_life, digits=3,
             caption = "Summmary statistics for first-fixation duration at the verb region") %>% 
  kableExtra::kable_styling(font_size = 30,
                            position = "left")
Summmary statistics for first-fixation duration at the verb region
lifetime N mean sd
dead 272 201.783 55.197
living 271 196.395 59.639

4.4.5 Slope

  • what was our slope?
  • what does this correspond to?
# mean(dead) - mean(living)
201.783 - 196.395
[1] 5.388
Code
# compute summary 
summary_ff_life <- df_crit_verb |> 
  filter(region=="verb",
         ff > 0) |> 
  group_by(lifetime) %>%
  summarise(N = n(),
            mean = mean(ff, na.rm = T),
            sd = sd(ff, na.rm = T))

knitr::kable(summary_ff_life, digits=3,
             caption = "Summmary statistics for first-fixation duration at the verb region") %>% 
  kableExtra::kable_styling(font_size = 30,
                            position = "left")
Summmary statistics for first-fixation duration at the verb region
lifetime N mean sd
dead 272 201.783 55.197
living 271 196.395 59.639

  • intercept: value of \(y\) when \(x\) = 0

  • slope: difference in \(y\) with a 1-unit change of \(x\)

  • treatment contrasts: factor levels are coded as 0 and 1 (alphabetically)

    • so when our predictor (\(x\)) = 0, this is the first alphabetical level of our factor (in our case, dead)
    • so our intercept will be the mean ff for the lifetime level dead

4.5 Method of least squares

  • how did lmer choose this particular line (namely, the slope)?
    • intercept = grand mean of observed data (with sum contrasts)
    • slope = predicted change in \(y\) over \(x\)
  • the procedure that finds the line that best fits the data is called the method of least squares
    • minimises the sum of squares (residuals are squared and summed)
  • method of least squares
    • find the line that has the lowest sum of squares

4.6 Changing our contrasts

  • it sometimes makes more sense for the intercept to represent the grand mean

    • to do this, we want 0 to be between our two factor levels
    • e.g., change the contrasts to -0.5 and +0.5
  • this is called sum coding

  • first, order our predictor

    • we predict longer reading times for dead versus living, so order living dead
    • this could also have been done with treatment contrasts (0,1)

4.7 Changing our contrasts

# order factor levels
df_crit_verb$lifetime <- factor(df_crit_verb$lifetime, levels = c("living","dead"))

# set contrasts
contrasts(df_crit_verb$lifetime) <- c(-0.5,+0.5)
# print contrasts
contrasts(df_crit_verb$lifetime)
       [,1]
living -0.5
dead    0.5

4.8 Fit model (sum contrasts)

# fit simple linear model
fit_ff <- df_crit_verb %>%
  filter(ff > 0) %>%
  lm(ff ~ lifetime, data = .)

4.8.1 Coefficients table with summary()

> summary(fit_ff)

Call:
1lm(formula = ff ~ lifetime, data = df_crit_verb, subset = ff > 0)

2Residuals:
    Min      1Q  Median      3Q     Max 
-118.78  -37.78  -11.39   25.22  264.61 

Coefficients:
3             Estimate Std. Error t value Pr(>|t|)
4(Intercept)   199.089      2.466  80.743   <2e-16 ***
5lifetimedead    5.388      4.931   1.093    0.275
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 57.46 on 541 degrees of freedom
6Multiple R-squared:  0.002202,  Adjusted R-squared:  0.0003575
F-statistic: 1.194 on 1 and 541 DF,  p-value: 0.275
1
formula
2
Residuals: differences between observed values and those predicted by the model
3
Names for columns Estimates, SE, t-value, p-value
4
Intercept (\(b_0\)), i.e., value of \(y\) (first fix.) with a move of one unit of \(x\) (lifetime)
5
Slope (\(b_1\)), i.e., change in first fixation going from dead to living
6
Output from an ANOVA
  • what is the intercept?
  • is the slope positive or negative?
    • what is it’s value?
  • this is what the slope would look like:

4.8.2 Understanding the summary

  • let’s compute summary statistics based on lifetime
    • then compare this to the model output

Exercises

  1. Subtract the mean first-fixation reading time of dead from that of living
    • what does this correspond to in the model summary?
  2. Compute the mean of dead+living
    • what does this correspond to in the model summary?
  3. Divide the slope in 2. Subtract this from the mean of dead.
    • what does this correspond to?

Summary statistics

Code
# compute summary 
summary_ff_life <- df_crit_verb |> 
  filter(region=="verb",
         ff > 0) |> 
  group_by(lifetime) %>%
  summarise(N = n(),
            mean = mean(ff, na.rm = T),
            sd = sd(ff, na.rm = T)) %>%
  # compute standard error, confidence intervals, and lower/upper ci bounds
  mutate(se = sd / sqrt(N),
         ci = qt(1 - (0.05 / 2), N - 1) * se,
         lower.ci = mean - qt(1 - (0.05 / 2), N - 1) * se,
         upper.ci = mean + qt(1 - (0.05 / 2), N - 1) * se)

knitr::kable(summary_ff_life, digits=3,
             caption = "Summmary statistics for first-fixation duration at the verb region") %>% 
  kableExtra::kable_styling(font_size = 24,
                            position = "left")
Summmary statistics for first-fixation duration at the verb region
lifetime N mean sd se ci lower.ci upper.ci
living 271 196.395 59.639 3.623 7.133 189.262 203.527
dead 272 201.783 55.197 3.347 6.589 195.194 208.372

Model summary

summary(fit_ff)

Call:
lm(formula = ff ~ lifetime, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-118.78  -37.78  -11.39   25.22  264.61 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  199.089      2.466  80.743 <0.0000000000000002 ***
lifetime1      5.388      4.931   1.093               0.275    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.46 on 541 degrees of freedom
Multiple R-squared:  0.002202,  Adjusted R-squared:  0.0003575 
F-statistic: 1.194 on 1 and 541 DF,  p-value: 0.275

4.9 Slope

  • intercept = value of \(y\) when \(x = 0\)

  • treatment contrasts: factor levels are coded as 0 and 1

    • intercept is mean of level that is coded as 0
  • sum contrast coding: factor levels are coded as +/-0.5 (or 1)

    • when \(x = 0\), this is the mid-way point between our two predictor levels
    • so the intercept will be the grand mean of our two levels
  • our slope is unchanged, however (unless we set our sum contrasts to +/- 1, which some people do)

5 Contrast coding

5.1 Comparing models

anova(fit_ff, fit_fp_1)
Analysis of Variance Table

Model 1: ff ~ lifetime
Model 2: ff ~ 1
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1    541 1786013                           
2    542 1789954 -1   -3941.3 1.1938  0.275

6 Exploring our model

  • the linear model contains fitted values corresponding to our observed values
    • these fitted values are fit to a straight line
    • our observed values are not fit to a straight line
    • the residuals are the differences along the \(y\) axis from the fitted to the observed values

Exploring the model

# how many observed values did we enter into the model?
df_crit_verb |> 
  filter(ff > 0) |> 
  nrow()
[1] 543
# how many observed values did we enter into the model?
length(fitted(fit_ff))
[1] 543

Exploring the model: residuals

# what do our FITTED values look like?
head(fitted(fit_ff))
       1        2        3        4        5        6 
196.3948 196.3948 201.7831 196.3948 201.7831 196.3948 
# what do our OBSERVED values look like?
head(df_crit_verb$ff)
[1] 175 207 228 231 212 168
# what is the difference between the FITTED and OBSERVED values?
head(df_crit_verb$ff) - head(fitted(fit_ff))
        1         2         3         4         5         6 
-21.39483  10.60517  26.21691  34.60517  10.21691 -28.39483 
# what are our RESIDUALS?
head(residuals(fit_ff))
        1         2         3         4         5         6 
-21.39483  10.60517  26.21691  34.60517  10.21691 -28.39483 

Exploring the model

  • what were our coefficients?
coef(fit_ff)
(Intercept)   lifetime1 
 199.088961    5.388254 
  • what is the mean of our predictor coded as -0.5?
coef(fit_ff)['(Intercept)'] + coef(fit_ff)['lifetime1'] * -0.5
(Intercept) 
   196.3948 
  • ignore the (Intercept) label here, R just takes the first label when performing an operation on 2 vectors

  • what is the mean of our predictor coded as +0.5?

coef(fit_ff)['(Intercept)'] + coef(fit_ff)['lifetime1'] * 0.5
(Intercept) 
   201.7831 

7 Assumptions

  • refer to residuals
    • i.e., the difference between the observed and the fitted (predicted) values
  • normality assumption
    • residuals of the model are (approximately) normally distributed
  • constant variance assumption (homoscedasticity)
    • spread of residuals should be (approximately) equal along the regression line
  • absence of collinearity
  • independence

7.1 Normality assumption

  • can be inspected e.g., with a histogram or Q-Q plot
df_crit_verb |> 
  filter(ff > 0) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot(x = "ff")

Normality assumption

  • how about by participant and experimental half?
df_crit_verb |> 
  filter(ff > 0) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot(x = "ff",
                    color = "half",
                    facet.by = "px")

Normality assumption

  • density plot (of residuatls)
plot(density(resid(fit_ff)))

Normality assumption

  • residual plot: not very helpful for binomial data!
plot(fitted(fit_ff),residuals(fit_ff))

Normality assumption

  • reading time data tends to be positively skewed
    • so the residuals also tend to be positively skewed
  • data with a skewed distribution is does not meet the normality assumption
  • a fix: nonlinear transformations
    • the most common: the log transformation
  • log-transforming your data makes larger numbers smaller (and small numbers smaller too)
    • the difference between smaller numbers and larger numbers shrinks
    • can make skewed data normally distributed

7.1.1 Log transformation

  • for more see Section 5.4 in Winter (2019)

  • the R funtion log() computes the ‘natural logarithm’ (and is the inverse of the exponential exp())

  • log() makes large numbers smaller

  • exp() makes small numbers larger

log(0)
[1] -Inf
log(1:10)
 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851
log(c(10,20,30,40,100))
[1] 2.302585 2.995732 3.401197 3.688879 4.605170
exp(1:10)
 [1]     2.718282     7.389056    20.085537    54.598150   148.413159
 [6]   403.428793  1096.633158  2980.957987  8103.083928 22026.465795
exp(log(1:10))
 [1]  1  2  3  4  5  6  7  8  9 10

7.2 Fit model (log-transformed)

  • continuous variables truncated at 0 typically have a positive skew
    • a lot of small values (e.g., tt < 500ms), with some larger values (> tt 1000)
    • this usually means our residuals are also positively skewed, i.e., not normally distributed
  • so we typically log-transform raw reading/reaction times for our linear models
# fit simple linear model with log
fit_ff_log <- df_crit_verb %>%
  filter(ff > 0) %>% # important! you can't log transform 0
  lm(log(ff) ~ lifetime, data = .)
summary(fit_ff_log)

Call:
lm(formula = log(ff) ~ lifetime, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85242 -0.17140 -0.01899  0.15691  0.89550 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  5.25458    0.01197 439.064 <0.0000000000000002 ***
lifetime1    0.03336    0.02394   1.394               0.164    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2789 on 541 degrees of freedom
Multiple R-squared:  0.003579,  Adjusted R-squared:  0.001737 
F-statistic: 1.943 on 1 and 541 DF,  p-value: 0.1639

7.2.1 Check assumptions

Raw first-fixation times

plot(density(resid(fit_ff)))

Log first-fixation times

plot(density(resid(fit_ff_log)))

Check assumptions

Raw first-fixation times

qqnorm(residuals(fit_ff))
qqline(residuals(fit_ff), col="red")

Log first-fixation times

qqnorm(residuals(fit_ff_log))
qqline(residuals(fit_ff_log), col="red")

8 Communicating your results

  • model summaries can be provided via tables and/or figures
    • you should always report the t-values and p-values of an effect
df_crit_verb |> 
  filter(ff > 0) |> 
  mutate(log_ff = log(ff)) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot(x = "log_ff")

df_crit_verb |> 
  filter(ff > 0) |> 
  mutate(log_ff = log(ff)) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot( x = "log_ff",
                    color = "half")

df_crit_verb |> 
  filter(ff > 0) |> 
  mutate(log_ff = log(ff)) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot( x = "log_ff",
                    color = "half",
                    facet.by = "px")

plot(density(resid(fit_ff)))

9 Summary

  • we saw that the equation for a straight line boils down to its intercept and slope

  • we fit our first linear model with a categorical predictor

  • we looked at some assumptions of linear models and how to check them

  • next, we’ll look at a case with more than one predictor: multiple regression

Important terms

dependent variable (DV) outcome, measure, \(x\)
independent variable (IV) predictor, fixed effect, \(y\)
equation for a straight line
Simple regression predicting outcome of a DV from an IV
Slope change in $y$ (DV) associated with a unit change in $x$ (IV) = 0
Intercept value of $y$ (DV) when $x$ (IV) = 0
Normality assumption
Residuals
Coefficients
Log-transformation

Important functions

lm(dv ~ 1 + iv, data = df_name) simple linear model
summary(model) print model summary
coef(model) print coefficients (intercept, slope)
log() log-transform a continuous variable

Session Info

Code
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magick_2.7.4     kableExtra_1.3.4 knitr_1.42       patchwork_1.1.2 
 [5] broom_1.0.4      lubridate_1.9.2  forcats_1.0.0    stringr_1.5.0   
 [9] dplyr_1.1.2      purrr_1.0.1      readr_2.1.4      tidyr_1.3.0     
[13] tibble_3.2.1     ggplot2_3.4.2    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] gtable_0.3.3      xfun_0.39         htmlwidgets_1.6.2 rstatix_0.7.2    
 [5] lattice_0.21-8    tzdb_0.4.0        vctrs_0.6.2       tools_4.3.0      
 [9] generics_0.1.3    curl_5.0.0        parallel_4.3.0    fansi_1.0.4      
[13] highr_0.10        pacman_0.5.1      pkgconfig_2.0.3   Matrix_1.5-4     
[17] webshot_0.5.4     lifecycle_1.0.3   farver_2.1.1      compiler_4.3.0   
[21] rbbt_0.0.0.9000   munsell_0.5.0     carData_3.0-5     htmltools_0.5.5  
[25] yaml_2.3.7        car_3.1-2         ggpubr_0.6.0      pillar_1.9.0     
[29] crayon_1.5.2      abind_1.4-5       nlme_3.1-162      tidyselect_1.2.0 
[33] rvest_1.0.3       digest_0.6.31     stringi_1.7.12    labeling_0.4.2   
[37] splines_4.3.0     cowplot_1.1.1     rprojroot_2.0.3   fastmap_1.1.1    
[41] grid_4.3.0        here_1.0.1        colorspace_2.1-0  cli_3.6.1        
[45] magrittr_2.0.3    utf8_1.2.3        withr_2.5.0       scales_1.2.1     
[49] backports_1.4.1   bit64_4.0.5       timechange_0.2.0  rmarkdown_2.21   
[53] httr_1.4.6        bit_4.0.5         ggsignif_0.6.4    png_0.1-8        
[57] hms_1.1.3         evaluate_0.21     viridisLite_0.4.2 mgcv_1.8-42      
[61] rlang_1.1.1       Rcpp_1.0.10       glue_1.6.2        xml2_1.3.4       
[65] svglite_2.1.1     rstudioapi_0.14   vroom_1.6.3       jsonlite_1.8.4   
[69] R6_2.5.1          systemfonts_1.0.4 fs_1.6.2         

References

DeBruine, L. M., & Barr, D. J. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science, 4(1), 251524592096511. https://doi.org/10.1177/2515245920965119
Field, A., Miles, J., & Field, Z. (2013). Discovering statistics using R (Vol. 50, Issue 04, pp. 2114-50-2114). https://doi.org/10.5860/choice.50-2114
Winter, B. (2013). Linear models and linear mixed effects models in R: Tutorial 1. https://bodowinter.com/tutorial/bw_LME_tutorial1.pdf
Winter, B. (2014). A very basic tutorial for performing linear mixed effects analyses (Tutorial 2). https://bodowinter.com/tutorial/bw_LME_tutorial2.pdf
Winter, B. (2019). Statistics for Linguists: An Introduction Using R. Routledge. https://doi.org/10.4324/9781315165547